{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2b677d5a",
   "metadata": {},
   "source": [
    "# t-test\n",
    "## theory\n",
    "    > main target: find mean differences bwtween two classes. Null hypothesis: no significant difference"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "457e68a2",
   "metadata": {},
   "source": [
    "## scipy implementation\n",
    "```doc\n",
    "ss.ttest_ind(\n",
    "    a,\n",
    "    b,\n",
    "    axis=0,\n",
    "    equal_var=True,\n",
    "    nan_policy='propagate',\n",
    "    permutations=None,\n",
    "    random_state=None,\n",
    "    alternative='two-sided',\n",
    "    trim=0,\n",
    ")\n",
    "Docstring:\n",
    "Calculate the T-test for the means of *two independent* samples of scores.\n",
    "\n",
    "This is a two-sided test for the null hypothesis that 2 independent samples\n",
    "have identical average (expected) values. This test assumes that the\n",
    "populations have identical variances by default.\n",
    "\n",
    "Parameters\n",
    "----------\n",
    "a, b : array_like\n",
    "    The arrays must have the same shape, except in the dimension\n",
    "    corresponding to `axis` (the first, by default).\n",
    "axis : int or None, optional\n",
    "    Axis along which to compute test. If None, compute over the whole\n",
    "    arrays, `a`, and `b`.\n",
    "equal_var : bool, optional\n",
    "    If True (default), perform a standard independent 2 sample test\n",
    "    that assumes equal population variances [1]_.\n",
    "    If False, perform Welch's t-test, which does not assume equal\n",
    "    population variance [2]_.\n",
    "\n",
    "nan_policy : {'propagate', 'raise', 'omit'}, optional\n",
    "    Defines how to handle when input contains nan.\n",
    "    The following options are available (default is 'propagate'):\n",
    "\n",
    "      * 'propagate': returns nan\n",
    "      * 'raise': throws an error\n",
    "      * 'omit': performs the calculations ignoring nan values\n",
    "\n",
    "    The 'omit' option is not currently available for permutation tests or\n",
    "    one-sided asympyotic tests.\n",
    "\n",
    "permutations : non-negative int, np.inf, or None (default), optional\n",
    "    If 0 or None (default), use the t-distribution to calculate p-values.\n",
    "    Otherwise, `permutations` is  the number of random permutations that\n",
    "    will be used to estimate p-values using a permutation test. If\n",
    "    `permutations` equals or exceeds the number of distinct partitions of\n",
    "    the pooled data, an exact test is performed instead (i.e. each\n",
    "    distinct partition is used exactly once). See Notes for details.\n",
    "\n",
    "\n",
    "random_state : {None, int, `numpy.random.Generator`,\n",
    "        `numpy.random.RandomState`}, optional\n",
    "\n",
    "    If `seed` is None (or `np.random`), the `numpy.random.RandomState`\n",
    "    singleton is used.\n",
    "    If `seed` is an int, a new ``RandomState`` instance is used,\n",
    "    seeded with `seed`.\n",
    "    If `seed` is already a ``Generator`` or ``RandomState`` instance then\n",
    "    that instance is used.\n",
    "\n",
    "    Pseudorandom number generator state used to generate permutations\n",
    "    (used only when `permutations` is not None).\n",
    "\n",
    "alternative : {'two-sided', 'less', 'greater'}, optional\n",
    "    Defines the alternative hypothesis.\n",
    "    The following options are available (default is 'two-sided'):\n",
    "\n",
    "      * 'two-sided'\n",
    "      * 'less': one-sided\n",
    "      * 'greater': one-sided\n",
    "\n",
    "trim : float, optional\n",
    "    If nonzero, performs a trimmed (Yuen's) t-test.\n",
    "    Defines the fraction of elements to be trimmed from each end of the\n",
    "    input samples. If 0 (default), no elements will be trimmed from either\n",
    "    side. The number of trimmed elements from each tail is the floor of the\n",
    "    trim times the number of elements. Valid range is [0, .5).\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "178e3232",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import scipy.stats as ss\n",
    "import numpy as np\n",
    "\n",
    "data_iris = load_iris()\n",
    "data_iris.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "d91cafd6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_f = data_iris[\"feature_names\"]\n",
    "iris_data = data_iris[\"data\"][:,:3]\n",
    "iris_target = data_iris.target\n",
    "iris_f[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "15a255a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 50), (1, 50), (2, 50)]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "c = Counter(iris_target)\n",
    "c.most_common()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "5ef801c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "class0 = []\n",
    "class1 = []\n",
    "class2 = []\n",
    "for idx,(feature, target) in enumerate(zip(iris_data,iris_target)):\n",
    "    if target == 0:\n",
    "        class0.append(feature)\n",
    "    elif target == 1:\n",
    "        class1.append(feature)\n",
    "    else:\n",
    "        class2.append(feature)\n",
    "class0 = np.array(class0, dtype=np.float64)\n",
    "class1 = np.array(class1, dtype=np.float64)\n",
    "class2 = np.array(class2, dtype=np.float64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "234cef40",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=array([-10.52098627,   9.45497585, -39.49271939]), pvalue=array([3.74674261e-17, 2.48422790e-15, 9.93443296e-46]))"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# t-test for mean between class-0 and class-1\n",
    "ss.ttest_ind(class0,class1,axis=0,equal_var=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "007a10af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=array([ -5.62916526,  -3.20576075, -12.60377944]), pvalue=array([1.86614439e-07, 1.81948348e-03, 4.90028753e-22]))"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# t-test for mean between class-1 and class-2\n",
    "ss.ttest_ind(class1,class2,axis=0,equal_var=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "adc8e4a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=array([-15.38619582,   6.45034909, -49.98618626]), pvalue=array([3.96686727e-25, 4.57077142e-09, 9.26962759e-50]))"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# t-test for mean between class-0 and class-2\n",
    "ss.ttest_ind(class0,class2,axis=0,equal_var=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e9924f9",
   "metadata": {},
   "source": [
    "## notification\n",
    "> 1. due to limitation of dimensionality, t-test will only work on 1-d data, which means that it could only calculate p value one feature at a time.  <br>\n",
    "> 2. This means we assume that features are independent to each other. Instead of modeling all features as one single compound probability density function, we simply divided each variable under the assumption that they are independent to each other, and compound pdf are the product of each marginal pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a9384db",
   "metadata": {},
   "source": [
    "# scatter matrix and FDR\n",
    "> this definition will use variance(second order central distance) and mean(first order original distance)\n",
    "<br>\n",
    "> within-class scatter matrix\n",
    "$$\n",
    "    \\mathbf{S_{\\omega}} = \\displaystyle\\sum_{i=1}^{M}{p_i\\Sigma_i}\n",
    "$$\n",
    "where\n",
    "$p_i = \\frac{n_i}{N}$ is the proportion of data of certain class to the whole data set, and $M$ is the total number of classes <br>\n",
    "and $\\Sigma_i = E[(x-\\mu_i){(x-\\mu_i)}^T]$ is the covariance matrix of ith-class\n",
    "\n",
    "> between-class scatter matrix \n",
    "$$\n",
    "   \\mathbf{S_b} = \\displaystyle\\sum_{i=1}^{M}{p_i(\\mu_i-\\mu_0){(\\mu_i-\\mu_0)}^T}\n",
    "$$\n",
    "where\n",
    "$\\mu_0 = \\displaystyle\\sum_{i=1}^{M}{p_i\\mu_i}$ is the global mean matrix <br>\n",
    "\n",
    "> mixture scatter matrix\n",
    "$$\n",
    "    \\mathbf{S_m} = E[(x-\\mu_0){(x-\\mu_0)}^T]\n",
    "$$\n",
    "\n",
    "> we could easily proof\n",
    "$$\n",
    "    \\mathbf{S_m} =  \\mathbf{S_{\\omega}} +  \\mathbf{S_b}\n",
    "$$\n",
    "\n",
    "> as we can see, in two class classification, $| \\mathbf{S_{\\omega}}|$ is proportional to ${\\sigma_1}^2 + {\\sigma_2}^2$ amd $| \\mathbf{S_{b}}|$ is proportional to ${(\\mu_1 - \\mu_2)}^2$ <br>\n",
    "> thus we could define <font color=maroon><b>FDR(Fisher's Discriminant Ratio)</b></font>\n",
    "$$\n",
    "    FDR = \\displaystyle\\sum_{i}^{M}\\displaystyle\\sum_{j \\neq i}^{M}\\frac{{(\\mu_1 - \\mu_2)}^2}{{\\sigma_1}^2 + {\\sigma_2}^2}\n",
    "$$\n",
    "\n",
    "which could be easier to memo\n",
    "$$\n",
    "    FDR \\sim \\frac{|\\mathbf{S_b}|}{|\\mathbf{S_{\\omega}}|}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11a19b0a",
   "metadata": {},
   "source": [
    "## LDA(Linear Discriminant Analysis)\n",
    "> one simple example with FDR <br>\n",
    "> implementation: sklearn discriminant_analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1937e2b",
   "metadata": {},
   "source": [
    "```doc\n",
    "LinearDiscriminantAnalysis(\n",
    "    solver='svd',\n",
    "    shrinkage=None,\n",
    "    priors=None,\n",
    "    n_components=None,\n",
    "    store_covariance=False,\n",
    "    tol=0.0001,\n",
    "    covariance_estimator=None,\n",
    ")\n",
    "Docstring:     \n",
    "Linear Discriminant Analysis.\n",
    "\n",
    "A classifier with a linear decision boundary, generated by fitting class\n",
    "conditional densities to the data and using Bayes' rule.\n",
    "\n",
    "The model fits a Gaussian density to each class, assuming that all classes\n",
    "share the same covariance matrix.\n",
    "\n",
    "The fitted model can also be used to reduce the dimensionality of the input\n",
    "by projecting it to the most discriminative directions, using the\n",
    "`transform` method.\n",
    "\n",
    "\n",
    "Parameters\n",
    "----------\n",
    "solver : {'svd', 'lsqr', 'eigen'}, default='svd'\n",
    "    Solver to use, possible values:\n",
    "      - 'svd': Singular value decomposition (default).\n",
    "        Does not compute the covariance matrix, therefore this solver is\n",
    "        recommended for data with a large number of features.\n",
    "      - 'lsqr': Least squares solution.\n",
    "        Can be combined with shrinkage or custom covariance estimator.\n",
    "      - 'eigen': Eigenvalue decomposition.\n",
    "        Can be combined with shrinkage or custom covariance estimator.\n",
    "\n",
    "shrinkage : 'auto' or float, default=None\n",
    "    Shrinkage parameter, possible values:\n",
    "      - None: no shrinkage (default).\n",
    "      - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.\n",
    "      - float between 0 and 1: fixed shrinkage parameter.\n",
    "\n",
    "    This should be left to None if `covariance_estimator` is used.\n",
    "    Note that shrinkage works only with 'lsqr' and 'eigen' solvers.\n",
    "\n",
    "priors : array-like of shape (n_classes,), default=None\n",
    "    The class prior probabilities. By default, the class proportions are\n",
    "    inferred from the training data.\n",
    "\n",
    "n_components : int, default=None\n",
    "    Number of components (<= min(n_classes - 1, n_features)) for\n",
    "    dimensionality reduction. If None, will be set to\n",
    "    min(n_classes - 1, n_features). This parameter only affects the\n",
    "    `transform` method.\n",
    "\n",
    "store_covariance : bool, default=False\n",
    "    If True, explicitly compute the weighted within-class covariance\n",
    "    matrix when solver is 'svd'. The matrix is always computed\n",
    "    and stored for the other solvers.\n",
    "\n",
    "tol : float, default=1.0e-4\n",
    "    Absolute threshold for a singular value of X to be considered\n",
    "    significant, used to estimate the rank of X. Dimensions whose\n",
    "    singular values are non-significant are discarded. Only used if\n",
    "    solver is 'svd'.\n",
    "\n",
    "covariance_estimator : covariance estimator, default=None\n",
    "    If not None, `covariance_estimator` is used to estimate\n",
    "    the covariance matrices instead of relying on the empirical\n",
    "    covariance estimator (with potential shrinkage).\n",
    "    The object should have a fit method and a ``covariance_`` attribute\n",
    "    like the estimators in :mod:`sklearn.covariance`.\n",
    "    if None the shrinkage parameter drives the estimate.\n",
    "\n",
    "    This should be left to None if `shrinkage` is used.\n",
    "    Note that `covariance_estimator` works only with 'lsqr' and 'eigen'\n",
    "    solvers.\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "ecae4233",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import  classification_report\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from collections import Counter\n",
    "import numpy as np\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "data_cancer = load_breast_cancer()\n",
    "data_cancer.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "0100999e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(569, 30)\n",
      "[(1, 357), (0, 212)]\n"
     ]
    }
   ],
   "source": [
    "cc = Counter(data_cancer.target)\n",
    "print(data_cancer.data.shape)\n",
    "print(cc.most_common(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "ba6cf131",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.97      0.91      0.94        43\n",
      "           1       0.95      0.99      0.97        71\n",
      "\n",
      "    accuracy                           0.96       114\n",
      "   macro avg       0.96      0.95      0.95       114\n",
      "weighted avg       0.96      0.96      0.96       114\n",
      "\n"
     ]
    }
   ],
   "source": [
    "cancer_data = data_cancer.data\n",
    "cancer_f = data_cancer.feature_names\n",
    "cancer_target = data_cancer.target\n",
    "cx_train, cx_test, cy_train, cy_test = train_test_split(cancer_data, cancer_target, \n",
    "                                                        test_size=0.2, random_state=42,\n",
    "                                                        shuffle=True)\n",
    "lda_clf = LinearDiscriminantAnalysis(n_components=1)\n",
    "lda_clf.fit(cx_train, cy_train)\n",
    "cy_pred = lda_clf.predict(cx_test)\n",
    "print(classification_report(cy_test, cy_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "427aa7a1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.80341976e+00, -5.39212301e-02, -4.39436335e-01,\n",
       "        -6.34042278e-03,  7.93926949e+00,  9.65027592e+01,\n",
       "        -1.94072071e+01, -9.48433359e+01,  6.52894977e+00,\n",
       "        -1.12179949e+02, -8.34003815e+00,  2.43731481e-01,\n",
       "         1.59124190e-01,  2.40224154e-02, -3.48805766e+02,\n",
       "         4.26007772e+01,  8.24576776e+01, -3.50357849e+02,\n",
       "         2.30465093e+01,  5.81140729e+01, -4.13737425e+00,\n",
       "        -1.85553514e-01,  1.68084303e-01,  1.85463710e-02,\n",
       "        -2.55422830e+00, -1.47149164e+01, -1.18801082e+01,\n",
       "         2.55875979e+01, -1.97016750e+01, -2.45735089e+01]])"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_clf.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "b300704a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([49.29932077])"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_clf.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "f735e300",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.74169231e+01, 2.14923077e+01, 1.15012959e+02, 9.75013609e+02,\n",
       "        1.02531124e-01, 1.43885148e-01, 1.59455266e-01, 8.67635503e-02,\n",
       "        1.93533136e-01, 6.26227219e-02, 6.00757988e-01, 1.20041598e+00,\n",
       "        4.28259763e+00, 7.18094675e+01, 6.75819527e-03, 3.17857870e-02,\n",
       "        4.18483432e-02, 1.50038935e-02, 2.06236627e-02, 3.97157988e-03,\n",
       "        2.10274556e+01, 2.92200592e+01, 1.40713964e+02, 1.41022781e+03,\n",
       "        1.44440769e-01, 3.71363373e-01, 4.51448994e-01, 1.81149467e-01,\n",
       "        3.26636095e-01, 9.11269822e-02],\n",
       "       [1.21680559e+01, 1.78216434e+01, 7.82140909e+01, 4.64910839e+02,\n",
       "        9.17334615e-02, 7.98258741e-02, 4.72053007e-02, 2.55395140e-02,\n",
       "        1.73751049e-01, 6.28359790e-02, 2.84577273e-01, 1.20402867e+00,\n",
       "        2.01659545e+00, 2.13169266e+01, 7.12550350e-03, 2.20011573e-02,\n",
       "        2.74909122e-02, 1.00562413e-02, 2.05438776e-02, 3.73115490e-03,\n",
       "        1.34032587e+01, 2.33585664e+01, 8.72421678e+01, 5.61890210e+02,\n",
       "        1.23904301e-01, 1.82647238e-01, 1.70089682e-01, 7.46106678e-02,\n",
       "        2.69150350e-01, 7.95783566e-02]])"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_clf.means_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "cb0fcba3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.41176352e+01, 1.91850330e+01, 9.18822418e+01, 6.54377582e+02,\n",
       "       9.57440220e-02, 1.03619319e-01, 8.88981451e-02, 4.82798703e-02,\n",
       "       1.81098681e-01, 6.27567692e-02, 4.02015824e-01, 1.20268681e+00,\n",
       "       2.85825341e+00, 4.00712989e+01, 6.98907473e-03, 2.56354484e-02,\n",
       "       3.28236723e-02, 1.18939407e-02, 2.05735121e-02, 3.82045560e-03,\n",
       "       1.62351033e+01, 2.55356923e+01, 1.07103121e+02, 8.76987033e+02,\n",
       "       1.31532132e-01, 2.52741802e-01, 2.74594569e-01, 1.14182222e-01,\n",
       "       2.90502198e-01, 8.38678462e-02])"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda_clf.xbar_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5cd4c51",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b2b2b96",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d291828",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
